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Experiments with networks of optogenetically altered neurons require stimulation with high 
spatio-temporal selectivity. Computer-assisted holography is an energy-efficient method for robust 
and reliable addressing of single neurons on the millisecond-timescale inherent to biologial infor- 
mation processing. We show that real-time control of neurons can be achieved by a CUDA-based 
hologram computation. 



I. INTRODUCTION 

The development of light-sensitive neurons has been 
a milestone in optogenetics [IJ. The ability to engineer 
a neuron's optical sensitivity by genetic manipulation is 
crucial for a non-destructive and fast, yet accurate photo- 
stimulation (PS) of individual sites in networks of living 
neurons. In vivo interaction with individual neurons is 
fundamental for a concise experimental study of such ba- 
sic neurological processes like the mechanisms of learning. 
In terms of energy efficiency and spatial resolution holo- 
graphic methods are considered to be the most suitable 
for PS [2 . 

Holograms, i.e. computer generated phase masks 
(PM), displayed on a spatial light modulator (SLM) real- 
ize pixel- wise phase retardations of a coherent laser beam. 
Upon illumination the intensity of the Fourier transform 
of the PM yields a high-resolution optical stimulation 
pattern (OSP) at the specimen. For a sketch of the ex- 
perimental setup see Fig. [T] The OSP follows from the 
subset of neurons selected for stimulation. By targeting 
specific neurons the neural activity in the network and 
thus its collective behavior can be influenced. 

The basis of neural activity is the generation of spikes 
in the membrane potential at the axon hillock due to 
synaptic input. The spikes travel along the axon to the 
synaptic connections to other nerve cells. In genetically 
altered neurons light-sensitive ion channels are expressed 
in the cell membrane. If lit with the correct frequency 
the ion channels open and thus change the membrane po- 
tential. This either inhibits or enhances spike generation. 
After transmission to the next neuron the spike adds to 
the synaptic input which may lead to another spike. 

For interactive modification of the spiking behavior the 
optical stimulus must be generated within the time a 
spike needs to travel from one neuron to another. In- 
terspike intervals of adjacent neurons are in the range of 
10-20 ms and set the time-scale for computing the un- 
known PM. Due to this severe time constraint multi-site 
stimulation thus had to use precomputed PMs, up to 



* Electronic address: |stkramer@math.uni-goettingen.de| 



now. For interactive network control PMs must be com- 
puted online which for frame rates in the required range 
of 0.1 to 1 kHz poses a substantial challenge. On cur- 
rent many- and multi-core processing units this requires 
extensive parallelization. 

Mathematically, computing a PM for a given OSP con- 
stitutes an inverse problem equivalent to wavefront re- 
construction (see [3] and references therein) and is an in- 
stance of the phase retrieval problem (PRP) in diffraction 
imaging [4 . Numerical approaches to the phase problem 
abound, but convergence results and global solutions are 
limited to special cases 16] that do not necessarily ap- 
ply to the case discussed in this paper. An arbitrary 
OSP is unlikely to have a phase-only Fourier transform. 
Thus our PRP is fundamentally inconsistent as defined 
in [7] . To account for the mathematical structure, a care- 
ful analysis of the performance of the parallelization tech- 
niques available and a strong focus on long-term software 
reusability distinguishes this work from others, e.g. [8]- 
10^. Useful approximations of a PM for a given OSP 
can be achieved by iterative algorithms like the widely 
used Method of Alternating Projections [11 , also known 
as Gerchberg-Saxton algorithm [12 . 

In this work we will combine parallel computation on 
graphics cards with C++-based generic programming 
and a sound mathematical theory. Only this combination 




FIG. 1: Holographic illumination of a network of 
optogenetically altered neurons. 
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of techniques allows to generate phase masks within less 
than 10 ms, matching the dynamics of neural activity. 



II. METHOD OF ALTERNATING 
PROJECTIONS 



The wavefront is to be altered by a phase shift at the 
finite number of pixels of the SLM. The entire sys- 
tem is modeled on a finite dimensional vector space. 
Let Lx^Ly > be the dimensions of the SLM and n^^, Uy 
the respective number of pixels. We seek a signal u G 
for N = Ux X Uy. The intensity distribution of the 
laser beam sets the amplitude of the wavefront u on 
the SLM. Assuming a constant intensity over one pixel, 
we discretize the intensity distribution by the nonnega- 
tive p G M^. Wavefronts u emanating from the SLM are 
given by the set of vectors 

S = {ueC^ : \ujk\ =Pjk, 

j = 1,2,. ..,na,, k = 1,2,. ..,n^}. (1) 

Propagation of the light through the lens system is mod- 
eled by Fraunhofer diffraction [13]. The light at the SLM 
is related to the observed OSP at the specimen by the 
Fourier transform F. Waves with modulus matching the 
amplitude distribution m G of the OSP form the set 



M = {ueC'^ : \{Fu 



jk\ 



m 



'jk-) 



j <nx, k < ny}.{2) 



On the one hand our wavefront must fulfill the amplitude 
constraint Eq. Q, on the other hand the amplitudes of 
its Fourier transform are fixed by the intensity distribu- 
tion of the stimulation pattern, Eq. Altogether, the 
mathematical problem we address is to 



Find ue SnM. 



(3) 



For a nonempty intersection the problem is defined to 
be consistent; otherwise inconsistent or ill-posed. A com- 
mon algorithm for problems of this type is the method 
of alternating projections [TTl |T2]. For a review of this 
and other projection-based approaches for the PRP see 
[6]. Algorithms of this kind are built on projection oper- 
ators onto the constraint sets S and M. By a projection 
of a point in a space X onto a subset C of that space, 
we mean the mapping of that point to the set of nearest 
points in C with respect to the norm induced by the real 
inner product on X. For general PRPs, it was proved 
in [3^ that 




if Ujk ^ 0; 



' ^ ^ (4a) 

Pjkexp{iO), for l9 G [0,27r) j ' 



M(ii)} (4b) 
are projections onto the sets S and M, respectively. 



where 



M{u) 



Vjk 



mj/e exp(i6>), for G [0, 27r) 



(4c) 

For given G C-'^ the method of alternating projections 
computes the iterates u" via 



,-+1 e PsPmu^ 



0,1,2,... 



(5) 



The multi-valuedness of Eq. Q makes Eq. (3| a non- 
convex feasibility problem [3 . Hence Eq. (p) must be 
understood as a selection from set- valued mappings. Due 
to nonconvexity, except in special cases ^, global con- 
vergence of Eq. ([5| cannot be guaranteed in general. For 
consistent PRPs local convergence results are available 
[7 . Yet, it is more the exception than the rule that our 
PRP will be consistent: a set of fixed amplitudes cannot 
produce an arbitrary OSP. Our numerical experiments 
indicate that our PRPs are indeed inconsistent as mea- 
sured by the magnitude of the gap 



G = \\Psu'' - PMW'h 



(6) 



between accumulation points in M and their projections 
onto S. The gap is measured in the standard Euclidean 
norm || • ||2. This systematic inconsistency is a major dif- 
ference between optogenetic PS and PRPs due to imaging 
experiments. In the latter the diffraction pattern com- 
prising the set M is causal^ that is, comes from diffrac- 
tion by a physical object, e.g. a protein crystal. Assum- 
ing that Eq. ([3| is inconsistent, we content ourselves with 
finding best approximation pairs (ix*, v*) G x with 
u* eS,v* e M, Pm^* V* and Psv* u\ 

To account for the particularities of optogenetic PS, 
we define the physical error as the sum of the pixel- wise 
relative violation of deviation tolerances between target 
and reconstructed OSP . We allow for a relative devi- 
ation = 0.1 for non-zero target pixels and an absolute 
deviation of t^ = 3 -10"^ for non-lit pixels. Violations are 
summed in multiples of t£ and td- With u^j^ being the in- 
tensity from the current iteration step, rrijk the intensity 
in the target OSP and O(-) the Heaviside step function, 
the total error is the sum of 
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(7b) 



III. UNIFIED IMPLEMENTATION 
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FIG. 2: Left: Software structure and its association with the different hardware components. Right: Class diagram 
of the PAAL concept. Class names are underlined. Template arguments are given in red boxes. Dashed boxes 

indicate partial template specializations. 



The compound system of CPU and GPU, each with ded- 
icated memories, represents a non-uniform memory ac- 
cess architecture with a very heterogeneous distribution 
of processing capabilities and internal transfer rates. Fig- 
ure [2al sketches the class structure and its distribution 
over the compound system of CPU and GPU. The ma- 
jor bottleneck is the PCIe-BUS. According to the PCIe 
v2.0 specification it has a maximal transfer rate of 8 
GByte/sec although in practice one rather gets 4 to 5 
GByte/sec. This will rise to 16 GByte/sec with the forth- 
coming PCIe v3.0, yet memory transfer rates within the 
GPU are of the order of 100 GByte/sec. On CPUs with 
integrated memory controllers the transfer rates are in 
the range of 20 to 30 GByte/sec. To anticipate the rapid 
evolution of hardware and parallelization techniques we 
spent considerable effort on modularizing the program 
using C++'s templating capabilities. CUDA extends C 
for programming NVidia CPUs. OpenMP provides mul- 
tithreaded parallelization on multi-core CPUs. Depend- 
ing on the parallelization technique the program works on 
different sides of the PCIe-BUS. To separate hardware- 
specific optimizations at the low-level, e.g. the details of 
Eq. Q, from the implementation of algorithms we intro- 
duced the concept of a parallel architecture abstraction 
layer (PAAL). Since most of our computational tasks are 
data-parallel they are perfect candidates for abstraction 
with respect to fioating-point precision and paralleliza- 
tion. This is achieved by a suitable set of template pa- 



rameters leaving the generation of the hardware-specific 
part of the code to the compiler. As FFTs in the projec- 
tor onto the constraint representing the OSP, Eq. (4b), 



we use either NVidia's cuFFT or the FFTW [15] which 
offers an OpenMP- as well as a pthreads-based variant. 

The aim of the PAAL concept is a quick recombina- 
tion of algorithms and parallelization strategies by ex- 
plicit template specializations. The front end comprises 
the user-interface (UI) and manages the execution of 
the phase retrieval algorithms for which separate driver 
classes exist, e.g. GS-DRIVER for the method of alter- 
nating projections. The final PM is transferred to an 
OpenGL framebuffer object for display on the SLM, cf. 
Fig. [2^ 

The PAAL concept is explained best by walking 
through the essential parts of its class structure. This 
is done roughly in a top down approach, i.e. from host 
to device and how things build on each other. A sketch 
of the class structure is given in Fig. l2b] At the top is 



the interface to the GS-Driver which is formed by the 
class HologramKernels. It takes two template arguments: t 
for the precision and arch for the architecture the algo- 
rithm is to run on. At the bottom of the hierarchy is the 
operation one has to do on a particular pixel. 

To express that the class HologramKernels is imple- 
mented with HologramKernelslmpI inheritance is private [16] 
(indicted by the dashed line in Fig. 2b). The class 



HologramKernels needs partial specializations for the differ- 
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ent architectures because for the CUDA kernels the wrap- 
per functions behave differently with respect to the ar- 
chitecture. On a NVidia GPU they have to call a CUDA 
kernel. On a CPU they have to either use OpenMP or 
pthreads for parallelization. The parallel execution of the 
per-pixel operation via OpenMP or pthreads can be done 
directly in the specialization of the wrapper function. In 
the following we omit the pthreads specialization as it 
is structurally very similar to OpenMP. The wrapper 
functions are implemented by the internal class impi of 
HoiogramKerneisimpi. The reason for this particular design 
is that the C++ standard does not allow to define par- 
tial specializations of (a subset of) the member functions 
of a class. This issue can be circumvented by introduc- 
ing an internal class with a dummy template parameter 
and to partially specialize its members. Within the class 
impi the particular type of real and complex numbers is 
deduced from the template parameter t by means of a 
suitable Traits structure. This is a typical example of 
template metaprogramming [T7] . 

The back end, i.e. the details of the implementa- 
tion, are stored in a separate source file to keep g++ 
away from CUDA-specific code. Within the evaluation 
of Eqs. (4a) and (4c) we need precision-dependent tol- 
erances for what is considered as zero. To this end we 
localize the inevitable magic numbers in a structure __eps 
and a function __is_zero. In case of being compiled into 
a CUDA kernel the __device__ keyword is put into effect 
indicating that the function can only be executed on the 
device, i.e. the GPU. Prepending __device__ by __host__ 
signals the compiler (i.e. nvcc) to compile two versions 
of a function or operator. One for the execution on the 
GPU and one for the CPU. At the binary level these are 
distinct functions. 

The actual per-pixel operation is done by an 
architecture- independent function __ps_eiement. Its argu- 
ments are a pointer d.devPtr* to the beginning of the array 
of pixels of the iterated image Fu^ ^ a pointer d.originai* 
to the beginning of the array of pixels of the original im- 
age and the lexicographic index of the pixel x. The CUDA 
kernel __ps (listingjl]) basically has the same arguments as 
the element function. The kernel gets the size of the im- 
age as additional argument in order to avoid operating 
on non-existent pixels. The kernel computes the posi- 
tion X of its pixel from its threadidx and Biockidx. Given 
the pixel position is within the bounds of the array the 
element function is invoked. 

Listing 1: PAAL: CUDA kernel for amplitude adaption 

template <typename T> 

__global__ void __ps (T *d_devPtr , const T 
*d_original , const int size) 

{ 

int X = blockDim . x*blockIdx . X + threadidx. x; 
if (x < size ) 

__ps_element <T , gpu_ cuda > ( d_de vPt r , d_original , 
x) ; 

} 



The missing link between back end and driver class is 
the specializations of the wrapper functions for the ker- 
nels. The GPU version (listing [2| starts as many threads 

as there are pixels for the kernel ps. Each thread 

starts the device function ps_eleinent, so that each 

pixel (vector element) is processed. 

Listing 2: PAAL: GPU specialization of wrapper function 

template <typename T> 
template < typename dummy > 
void 

HoiogramKerneisimpi <T>: :Impl <gpu_cuda , 
dummy > : : ps 

(Complex *d_devPtr , const Complex *d_original , 
const int size) 

{ 

int threads_per_block = 512; 

int blocks = (size + threads_per_block - 1) / 
threads_per_block ; 

__ps <T><<<blocks , 

threads_per_block >>>(d_devPtr , d_original , 
size ) ; 

cudaThr eadSynchr onize () ; 
} 

The CPU-OpenMP version (listing [3| has a f or-loop 
over all pixels in the image which is parallelized by 
an OpenMP preprocessor directive. By declaring the 

ps.element to be a host device function, 

we can call the same function from the CPU as from the 
GPU but without the intermediate kernel layer. In this 
way we have the actual computation implemented only 
once. With the individual specialized classes wrapped 
around this implementation we can choose our comput- 
ing precision and hardware. 

Listing 3: PAAL: CPU specialization of wrapper function 

template < typename T> 
template < typename dummy > 
void 

HoiogramKerneisimpi <T>: : Impl<cpu, dummy > : : ps 
(Complex *d_devPtr , const Complex *d_original , 
const int size) 

{ 

#pragma omp parallel for private (i) 
for(int i = 0; i < size; i++) 

__ps_element <T , cpu > ( d_de vPt r , 
d_or iginal , i ) ; 

} 



Finally, we have to provide full template specializations 
of all the combinations of precision and architecture tem- 
plate parameters we want to work with. This must be at 
the end of the hardware-specific source file as all functions 
have to be declared and their bodies defined before the 
class can be explicitly instantiated by the compiler [18]. 
The explicit specializations are necessary since we com- 
pile the back end with a different compiler than the front 
end of the program. 



5 




5 10 15 20 25 

n iterations 

FIG. 3: Convergence for a spot pattern of the physical error and the constraint gap. 



IV. RESULTS 

The OSP for benchmarking the computation of the 
phase masks for photo-stimulation are bright spots on 
a dark background (Fig. [3|. The limits of spatial res- 
olution in the reconstruction is tested on the Siemens 
star (Fig. [4|. In both cases we use Pm^^ as initial PM 
where is the 1-bit target OSP. Thus, our initial condi- 
tion is computed by taking the Fourier transform of the 
OSP, adapting the Fourier coefficients to the amplitude 
constraints on the SLM and transforming back into real 
space again. The other obvious choice as initial condi- 
tion would be to use random phases. The resulting phase 
masks do not differ significantly from those obtained by 



using Pmu^ but take longer to converge. Therefore we 
skip them in the following discussion. 



A. Benchmarking 

For interactive holographic PS the physical er- 
ror, Eq. ([7|, must reach a sub-threshold level within in- 
terspike intervals, i.e. 10 to 20 ms. Hence, the first is- 
sue is which parallelization technique provides sufficient 
performance to meet this requirement. The GPU-based 
computations were done on a Tesla C2070. The CPU- 
based ones using OpenMP or pthreads were run on a 
two-socket system with X5650 Xeons. The CPUs have 
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FIG. 4: Siemens star: Convergence for single and double precision. 



six cores each. Therefore we decided to use 12 threads, 
i.e. as much as there are physical cores. 

Computing a single PM on the CPU takes several sec- 
onds no matter whether OpenMP or pthreads are used. 
When using CUDA and thus the GPU the total runtimes 
match the interspike interval constraint. For a typical 
resolution of 800 x 600 for an SLM the computation of 25 
iterations in single precision takes 45 ms including trans- 
fer of the given OSP to the GPU (1 ms). The iteration 
essentially converges after one step (Fig. [3|. Hence a 
reasonable OSP is available already after much less than 
10 ms. However, the precise figure depends on the prob- 
lem size and number of iteration steps. Thus we keep 
10 ms as a conservative bound. The left panel of Fig. |5] 
shows GPU runtimes per iteration in total and broken 
down into the contributions due to FFT (green and cyan 
bars) and enforcement of amplitude constraints (blue and 
grey bars) for different image sizes and 25 iterations of 
Eq. ([5|. The runtimes are further subdivided into the re- 
sults for single and double precision indicated by the red 
and magenta bars. The proportion of work to be done in 



the FFT increases with problem size as the FFT is of log- 
linear complexity. Enforcing the amplitude constraints 
is linear in the problem size as each pixel is visited only 
once per iteration and exchange of information between 
different pixels is not required. 

The right panel of Fig. |5] shows the speedups of the 
CUDA implementation over its OpenMP and pthreads 
counterparts. On average CUDA is 50 times faster per 
iteration than the 12-thread CPU variants. The perfor- 
mance gain per iteration solely depends on the size of an 
OSP. For the Fermi architecture used in the Tesla C2070 
the floating point performance in double precision is half 
of the one for single precision. This is due to the fact that 
a double is twice as large as a float and thus requires 
twice as much memory bandwidth. On CPUs this is less 
of an issue since they focus on hiding memory latencies 
by branch prediction. This is reflected by the fact that 
for double precision the speedup is roughly only half of 
the one for single precision. Yet, this suffices to get OSPs 
in double precision within the limits set by the interspike 
intervals. 
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FIG. 5: Left: GPU runtimes per iteration for various image sizes and 25 iterations of Eq. ([5|. Right: Speedup per 

iteration for various image sizes. 



B. Precision and Convergence 

The second issue is the influence of the floating point 
precision e on convergence and performance. Figure |3] 
shows the convergence and reconstruction results for a 
spot pattern as it would be used in a PS experiment. Fig- 
ure |4] summarizes the results for the Siemens star which is 
a standard test image for the spatial resolution achieved 
by reconstruction algorithms. 

The reconstructed OSPs are given as inset with a log- 
arithmic gray scale for intensity. The convergence curves 
represent the behavior of the physical error, Eq. ([7|), and 
the gap, Eq. j6|, with respect to the number of iterated 
steps in Eq. (jsT. We are interested in the influence of the 
hardware architecture and the precision. Hence the con- 
vergence history is given for single and double precision 
on GPU and CPU. The details of the curves for the gap 
and for the physical error in the insets reveal that the 
behavior primarily depends on whether the computation 
is run in single or double precision but not on the archi- 
tecture. This is a subtle effect on the order of the single 
precision accuracy as illustrated by the scaling of the or- 
dinates in the insets. Both figures show that convergence 
of the PM in single is as good as in double precision as 
each quantity has a unique limit value independent of the 
precision. 

The intensity plots of the reconstructed OSPs indicate 
that the contrast between lit and unlit areas is 3 orders 
of magnitude. A look at the center of the Siemens star 
shows that structures down to a few pixels can be re- 
solved. The insets of figures [3] and [4] demonstrate that 
the gap, as defined in Eq. (|6|, and the physically moti- 



vated error, Eq. ([7|), saturate within one iteration indi- 
cating the inconsistency of our PRP. All further changes 
are 0{e). Convergence does not depend on hardware as 
the limit values of error and gap are several orders of 
magnitude larger than any precision. Depending on e we 
expect the following resolution limits for G. 



Our number of pixels is of the order of 10^. As- 
suming statistical independence for the errors of u^j^ we 
get as theoretical limit Gth oc lO^e, i.e. 10~^ for sin- 
gle precision (e = 10-^) and 10"^^ for double preci- 
sion (e = 10~^^). An interesting phenomenon reflecting 
the difference between exact and finite precision arith- 
metic is revealed by comparing the convergence behavior 
as function of e. Single precision (blue and red curves) 
cannot resolve the inconsistency of the PRP, i.e. whether 
or not 5' n M = 0, as G ^ Gth- According to [7] this 
should improve convergence. The downside is, that while 
the PRP appears to be consistent from a numerical point 
of view, larger e means worse approximation of the pro- 
jection operators. For double precision (green and ma- 
genta curves) we get more accurate projection operators 
but at the same time the gap is resolved as for both 
precisions G is of similar magnitude. This renders the 
PRP inconsistent again, justifying our assumption of in- 
consistency. Our results also show that, despite a rather 
large G the method of alternating projections does not 
suffer from stagnation at bad local minima which oth- 
erwise would call for more sophisticated algorithms like 
RAAR [19 . 
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V. CONCLUSIONS 

Mathematically, computing a phase-only hologram to 
create an OSP which selects predetermined neurons is 
equivalent to the problem of wavefront reconstruction. 
Useful approximations of a PM for a given OSP can only 
be achieved by iterative algorithms like the widely used 
Method of Alternating Projections. 

From the point of view of software engineering we have 
shown how to integrate CUBA into a complex software 
environment in a modular way without sacrificing per- 
formance. The high modularity of our simulation frame- 
work has several key advantages. Code redundancy is 
minimized. The template techniques let the code refiect 
the mathematical structure of the problem. The effort 
to switch between the three parallelization techniques 
tested, CUBA, OpenMP and pthreads, is reduced to a 
single word in a single line of code and can be done ei- 
ther at compile or at run time. The framework makes it 
easy to implement other reconstruction algorithms and 
to apply it to other problems of wavefront reconstruc- 
tion totally unrelated to the presented test case from the 
field of optogenetics. For instance, we could integrate the 



ideas discussed by Thalhammer et al. for speeding up the 
switching of liquid crystal SLMs [20 . Typical switching 
times are of the order of 10 ms and thus may interfere 
with the spiking dynamics of the neurons. 

Finally, only the CUBA-based implementation is ca- 
pable of the necessary frame rates for stimulating net- 
works of optogenetically altered neurons on their intrin- 
sic timescale of several ms. Our results show that at most 
5 iterations suffice to compute a phase mask within less 
than 10ms, matching the time-scale of the dynamics of 
neural activity. 
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